Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
aaskay
GitHub Repository: aaskay/Sloan-Digital-Sky-Survey
Path: blob/master/SDSS Classification.ipynb
54 views
Kernel: Python 3

<img src="https://trac.sdss.org/chrome/site/sdss.png", width=300, height=300>

I was searching for an astronomy dataset on Kaggle and came across this project (https://www.kaggle.com/lucidlenn/data-analysis-and-classification-using-xgboost). I'd like to give credit to Lennart Grosser for a couple of the plots that I use in this notebook.

Getting the data

I used the CasJobs website which offers a SQL-based interface to query the databases that contain the SDSS data. The CasJobs system allows you to save the data returned from the query in csv format.

For more information about how to get data from the SDSS see their Data Access Guide:

http://www.sdss.org/dr14/data_access/

I used the following query to retreive the data:

SELECT TOP 100000
p.objid,
p.ra,
p.dec,
p.cModelMag_u,
p.cModelMag_g,
p.cModelMag_r,
p.cModelMag_i,
p.cModelMag_z,
p.modelMag_u,
p.modelMag_g,
p.modelMag_r,
p.modelMag_i,
p.modelMag_z,
p.psfMag_u,
p.psfMag_g,
p.psfMag_r,
p.psfMag_i,
p.psfMag_z,
p.petroMag_u,
p.petroMag_g,
p.petroMag_r,
p.petroMag_i,
p.petroMag_z,
s.specobjid,
s.class,
s.z as redshift
FROM PhotoObj AS p
JOIN SpecObj AS s ON s.bestobjid = p.objid
WHERE
p.cModelMagErr_u BETWEEN 0 AND 0.5
AND g BETWEEN 0 AND 20

%matplotlib inline import numpy as np import pandas as pd import matplotlib import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.model_selection import GridSearchCV from sklearn.model_selection import RandomizedSearchCV from sklearn.model_selection import cross_val_score from sklearn.model_selection import KFold from sklearn.metrics import accuracy_score from sklearn.feature_selection import RFE from sklearn import preprocessing from sklearn.preprocessing import LabelEncoder #import xgboost as xgb import seaborn as sns from xgboost.sklearn import XGBClassifier from sklearn import metrics #Additional scklearn functions sns.set_style('whitegrid')

Import the data

Define the columns that we'll use and import the data from the csv file. The cModelMag filter columns were chosen based on information at http://www.sdss3.org/dr8/algorithms/magnitudes.php#cmodel. Specifically, under the section titled 'Which Magnitude Should I Use?', we have the following: "...the cmodel magnitude is now an adequate proxy to use as a universal magnitude for all types of objects".

myFeatures = ['ra','dec','cModelMag_u','cModelMag_g','cModelMag_r','cModelMag_i','cModelMag_z','redshift'] myTargets = ['class'] sdss_data = pd.read_csv('SDSS_Small.csv',usecols=myFeatures+myTargets)
sdss_data.describe()
sdss_data.shape
(100000, 9)
sdss_data.head()

Get a count of each type of class we are working to predict.

sdss_data['class'].value_counts()
GALAXY 52401 STAR 30367 QSO 17232 Name: class, dtype: int64

We see that our data is not balanced between the classes.

sdss_features = sdss_data[myFeatures] sdss_targets = sdss_data[myTargets]
sdss_features.head()
sns.pairplot(x_vars='ra', y_vars='dec', data=sdss_data, hue='class',palette="Set1", size=5, aspect=2,markers=["o", "s", "D"],plot_kws={'alpha': 0.25}) plt.title('Equatorial coordinates')
<matplotlib.text.Text at 0x198b02133c8>
Image in a Jupyter notebook

Looking at the above plot of right ascention and declination for each of our class types, we see that there is no clear seperation in the positions of our objects. We can conclude that these features will not add any real predictive value to our model and therefore we can feel confident in dropping them from our dataset.

sdss_features = sdss_features.drop(['ra','dec'],axis=1)
sdss_features.head()

Now are dataset is only in terms of frequency filters and redshift.

Redshift

Looking at the redshift for each class of object, we see that the further away an object is the more redshifted it appears. From these graphs, it looks like the redshift will be an important feature in classifying observations.

fig, axes = plt.subplots(nrows=1, ncols=3,figsize=(16, 4)) ax0, ax1, ax2 = axes.flatten() n_bins = 20 x = sdss_data[sdss_data['class']=='STAR'].redshift y = sdss_data[sdss_data['class']=='GALAXY'].redshift z = sdss_data[sdss_data['class']=='QSO'].redshift ax0.hist(x, bins = n_bins) ax0.set_title('Star') ax1.hist(y, bins = n_bins) ax1.set_title('Galaxy') ax2.hist(z, bins = n_bins) ax2.set_title('QSO') fig.tight_layout() plt.show()
Image in a Jupyter notebook

The plots above confirm our intuition that the further an object is from the observer, the more redshifted it will be. Redshift will be a good predictor value but unfortunately, there is a bit of overlap between the reshift of some galaxies and some QSO's.

Feature Corrolation Plot:

sns.set(style="white") # Compute the correlation matrix corr = sdss_features.corr() # Set up the matplotlib figure fig, axes = plt.subplots(figsize=(8, 8)) # Generate a custom diverging colormap cmap = sns.diverging_palette(220, 10, as_cmap=True) # Draw the heatmap with the mask and correct aspect ratio sns.heatmap(corr, cmap=cmap, vmax=.3, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5}) plt.yticks(rotation=0) plt.xticks(rotation=45) plt.show()
Image in a Jupyter notebook

We can see some correlation between the cModelMagnitudes for the u,g, and r filters as well as between the i and z filters. There also seems to be some correlation between the cModelMagnitudes for the u,g, and r filters and redshift.

sdss_standardized_features = preprocessing.StandardScaler().fit(sdss_features).transform(sdss_features) features = np.array(sdss_standardized_features) targets = np.array(sdss_targets)

Split the data into test and training subsets

training_data, test_data, training_targets, test_targets = train_test_split(features, targets, test_size=0.25, random_state=42)

Encode the targets. XGBoost cannot use string classes directly

# encode string class values as integers label_encoder = LabelEncoder() label_encoder = label_encoder.fit(training_targets.ravel()) training_targets = label_encoder.transform(training_targets.ravel()) label_encoder = label_encoder.fit(test_targets.ravel()) test_targets = label_encoder.transform(test_targets.ravel())

Define a basic XGBoost classifier

# max_delta_step set to one since we have an embalance in classes. Can help with convergence. xgb = XGBClassifier(max_delta_step=1,random_state=42)

Fitting an XGBoost model with default parameters

xgb.fit(training_data, training_targets)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=1, max_depth=3, min_child_weight=1, missing=None, n_estimators=100, n_jobs=1, nthread=None, objective='multi:softprob', random_state=42, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=True, subsample=1)

Above we see the default parameters used by the classifier.

Make prediction and look at the accuracy

predictions = xgb.predict(test_data) xgb_accuracy = (predictions == test_targets).sum().astype(float) / len(predictions)*100 print("XGBoost's prediction accuracy is: %3.2f" % (xgb_accuracy))
XGBoost's prediction accuracy is: 97.64

Use sklearn's RandomizedSearchCV to try and tune some parameters

n_estimators = range(800,1600,100) max_tree_depth = [8] #range(3,9) eta = [0.1] #[0.01,0.05,0.1,0.5,1] #learning rate lmbda = [1,2,3,4] #[0.5,1,2] #L2 regularization term on weights min_child_wt = [1,2,3] gamma = [0,0.1,0.5] #[0,1,2,4] #Minimum loss reduction required to make a further partition on a leaf node of the tree.
xgbParams = { 'learning_rate':eta, 'gamma':gamma, 'n_estimators':n_estimators, 'reg_lambda':lmbda, 'max_depth':max_tree_depth, 'min_child_weight':min_child_wt }
clf = RandomizedSearchCV(xgb,xgbParams)
clf.fit(training_data, training_targets)
RandomizedSearchCV(cv=None, error_score='raise', estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=1, max_depth=3, min_child_weight=1, missing=None, n_estimators=100, n_jobs=1, nthread=None, objective='multi:softprob', random_state=42, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=True, subsample=1), fit_params=None, iid=True, n_iter=10, n_jobs=1, param_distributions={'learning_rate': [0.1], 'gamma': [0, 0.1, 0.5], 'n_estimators': range(800, 1600, 100), 'reg_lambda': [1, 2, 3, 4], 'max_depth': [8], 'min_child_weight': [1, 2, 3]}, pre_dispatch='2*n_jobs', random_state=None, refit=True, return_train_score=True, scoring=None, verbose=0)
clf_best_parameters = clf.best_estimator_ print(clf_best_parameters)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bytree=1, gamma=0.5, learning_rate=0.1, max_delta_step=1, max_depth=8, min_child_weight=1, missing=None, n_estimators=900, n_jobs=1, nthread=None, objective='multi:softprob', random_state=42, reg_alpha=0, reg_lambda=2, scale_pos_weight=1, seed=None, silent=True, subsample=1)
preds = clf.predict(test_data)
xgb_accuracy = (preds == test_targets).sum().astype(float) / len(preds)*100 print("XGBoost's prediction accuracy is: %3.2f" % (xgb_accuracy))
XGBoost's prediction accuracy is: 98.19

We achieve a slight improvement by searching over a range of parameter values to try to hypertune the learning parameters of the model.

The Confusion Matrix

from sklearn.metrics import confusion_matrix import itertools
cm = confusion_matrix(test_targets, preds) classes = label_encoder.inverse_transform(test_targets) classes = np.unique(classes)
plt.imshow(cm, interpolation='nearest', cmap=cmap) plt.title('Confusion matrix') plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=45) plt.yticks(tick_marks, classes) #fmt = '.2f' fmt = 'd' thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): plt.text(j, i, format(cm[i, j], fmt), horizontalalignment="center", color="white" if cm[i, j] > thresh else "black") plt.tight_layout() plt.ylabel('True label') plt.xlabel('Predicted label') plt.show()
Image in a Jupyter notebook

The results are in line with what we saw in the redshift plots for the different classes of objects. Galaxies and QSO's are much harder to differentiate.

We have used the XGBoost classifier to predict with 98 percent accuracy the class (type) of stellar object for a given SDSS observation. Further hypertuning of parameters might acheive a slight increase in accuracy but further research into the features available in the data would probably be the best way to increase accuracy of the model in its ability to differentiate galaxies from QSO's.